Skip to content

extract(hpc): scitex.hpc → scitex-hpc v0.1.0 (generic SLURM dispatch)#258

Open
ywatanabe1989 wants to merge 10 commits intodevelopfrom
extract/scitex-hpc
Open

extract(hpc): scitex.hpc → scitex-hpc v0.1.0 (generic SLURM dispatch)#258
ywatanabe1989 wants to merge 10 commits intodevelopfrom
extract/scitex-hpc

Conversation

@ywatanabe1989
Copy link
Copy Markdown
Owner

Summary

Extract the SLURM dispatch logic out of scitex_dev.test_runner into a standalone scitex-hpc v0.1.0 package, bridged via sys.modules alias.

API

from scitex.hpc import JobConfig, srun, sbatch, sync, poll_job, fetch_result
  • JobConfig dataclass with SCITEX_HPC_* env-var override resolution
  • 5 functions covering the full dispatch lifecycle
  • Login nodes never run compute — every command goes through bash -lc to load SLURM modules, then srun / sbatch
  • Defaults match SciTeX use (Spartan / sapphire / 16-core / 20-min)

Test plan

  • python -c "import scitex.hpc as h; import scitex_hpc as r; assert h is r" — passes
  • All 5 functions + JobConfig present on the alias
  • scitex-hpc test suite: 12/12 pass (mocked subprocess; no live SLURM required)
  • Live SLURM smoke test: srun --partition=cascade hostname dispatched via bash -lc 'srun ...' SSH, ran on spartan-bm022 (verified before extraction)

Why

The functions were generic SLURM dispatch — not specific to dev tooling. They should be reusable from any consumer, not just scitex_dev. The umbrella re-export means from scitex.hpc import srun works in any package.

ywatanabe1989 and others added 10 commits April 27, 2026 22:38
scitex_dev.test_runner's HPC dispatch (run_hpc_srun/sbatch/sync/poll/fetch)
was generic SLURM code that didn't belong in dev tooling. Extracted to
standalone scitex-hpc v0.1.0 package: https://github.com/ywatanabe1989/scitex-hpc

Public API in scitex.hpc:
- JobConfig (dataclass with SCITEX_HPC_* env-var override resolution)
- srun (blocking interactive)
- sbatch (async, returns job ID)
- sync (rsync local → host)
- poll_job (sacct status)
- fetch_result (scp .out file back)

Login nodes never run compute — every command goes through bash -lc to
load SLURM modules, then srun/sbatch.

scitex.hpc is scitex_hpc: True (verified)
scitex-hpc tests: 12/12 pass
… trio

§6c — value-precedence cascade (direct → yaml → env → default) via
scitex_config.PriorityConfig. CLI flags always win; do not hand-roll.

§9 — observation/dry-run/execute pattern for mutating commands:
- Mode flags: default observation, --dry-run preview, --<verb>-<scope> execute
- Flag-naming: name action by scope (--update-hosts not --apply)
- --reference names source-of-truth for state-converging ops
- Filter flags use plural scope nouns (--hosts, --packages)
- Dry-run is enforced via manifest gate (canonical: scitex-dev rename-symbols)

Audit checklist updated with the new requirements.
… order

Adds 'Resolution precedence' section listing the cascade
(direct → yaml → env → default) and pointing to the canonical
scitex_config.PriorityConfig implementation. Cross-references
03_interface_02_cli.md §6c.
Augments 06_skills_04_editable-installation.md with a concrete pre-tag
verification step that catches the silent setuptools failure mode where
SKILL.md is on git but absent from the wheel.

Real instance: scitex-hpc 0.6.1 (2026-04-28) — package-data entry was
missing, build succeeded, CI green, but the wheel didn't contain the
SKILL.md. Caught at the unzip-l step before tagging; shipped 0.6.2 the
same day with the fix.

Adds: why setuptools' packages.find doesn't auto-include markdown data,
the 5-second 'unzip -l' pre-tag check with expected output shape, and
the post-install belt-and-suspenders verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntries

Two general/* leaves existed on disk but weren't linked from SKILL.md
(skill-discovery agents couldn't find them):
  - 04_docs_03_rtd.md (Read the Docs onboarding)
  - 99_quality_03_packaging-bar.md (packaging quality bar)

98_quality_01_failure-playbook.md gains three entries from concrete
incidents during the 2026-04-27/28 multi-tenant scitex-hpc + sac
rollout, all documented with symptom / root cause / fix / where-found:

  §8 a2a-sdk + protobuf 6.x — FieldDescriptor.label AttributeError
     (caught CI red on sac develop; fix: protobuf<6, not <7)
  §9 SLURM cgroup kills tmux spawned by srun --overlap
     (caught Phase 4 prototype; fix: tmux as PID 1 of sbatch script)
  §10 Chatty login-shell banners break SLURM-output parsing
     (caught book() polling forever on Spartan; fix: parse line-by-line
     against a known SLURM-state vocabulary)

Future agents hitting any of these symptoms now have the playbook entry
to point them straight at the fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- 05_version-control_02_release-automation.md: replace deprecated
  sync-remote / fix-mismatches with the unified ecosystem packages
  command (--hosts, --packages, --dry-run, --update-hosts)
- 99_quality_03_packaging-bar.md: add §5a wheel-vs-source data-file
  audit (scitex_dev.audit_package_data) catching SKILL.md and other
  data-file drift before publish
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant